Spatial Ecology and Macroecology

Practical - Week 1

Florencia Grattarola, Friederike Wölke, Gabriel Ortega

(Department of Spatial Sciences)

2023-10-02

What are we going to see today?

  1. Data types
  2. Data sources
  3. Open data
  4. Exercise 1: explore data sources
  5. Exercise 2: data download and cleaning in R

Data types

1. Data types

Data that can place a particular taxa in a particular location and time can take many forms.

1. Data types

Opportunistic incidence records

  • presence-only data from museum, herbarium collection or citizen-science initiatives
  • single species, spatio-temporally specific, unknown absences

PROS CONS
huge amounts of data available, easily aggregated often without details of effort/method, wide variation in data quality

1. Data types

Presence-absence data

  • data from inventories, checklists, atlas, acoustic sensors, DNA sampling or camera-trap sruveys
  • multiple species, spatio-temporally specific, report searches that did not find the species (absences)

PROS CONS
absences are informative, area and effort are measured less abundant (too time consuming), methods are species-specific

1. Data types

Repeated surveys

  • monitoring schemes, repeated atlas projects
  • multiple species, over time, spatially defined, use a standardized protocol

PRO CONS
standardised protocols, multiple points in time expensive: geographically restricted, usually temporally too

1. Data types

Range-maps

  • outlines of species distributions, IUCN ranges, field guides
  • single species, expert-drawn

PROS CONS
rough estimates of the outer boundaries of areas within which species are likely to occur large spatial and temporal uncertainties

1. Data types

Data can also be defined as how they were collected.

1. Data types

Structured

  • clear survey design (location, target) and standardised sampling protocol
  • site selection: preselected locations, sometimes stratified random
  • metadata: informs about the survey methods

1. Data types

Semi-structured

  • no survey design but little standardised sampling protocol
  • site selection: free
  • metadata: informs about the observation process and survey methods

1. Data types

Unstructured (opportunistic)

  • no survey design and no standardised sampling protocol
  • site selection: free
  • metadata: almost non

1. Data types

Finally, data can also be defined as how they are made available for others.

1. Data types

Disaggregated

  • precision is high, but completeness and representativeness are low.

1. Data types

Aggregated

  • precision is low, but completeness and representativeness are high.

2. Data sources

gbif.org

GBIF is an international network and data infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.

rgbif: https://github.com/ropensci/rgbif

obis.org

OBIS is a global open-access data and information clearing-house on marine biodiversity for science, conservation and sustainable development.

robis: https://github.com/iobis/robis

ebird.org

eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.

auk: https://cornelllabofornithology.github.io/auk/

ebird.org

eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.


rebird: https://github.com/ropensci/rebird

inaturalist.org

iNaturalist is one of the world’s most popular nature apps. It allows participants to contribute observations of any organism, or traces thereof, along with associated spatio-temporal metadata.

rinat: https://github.com/ropensci/rinat

mol.org

Map of Life endeavors to provide ‘best-possible’ species range information and species lists for any geographic area. The Map of Life assembles and integrates different sources of data describing species distributions worldwide.

iucnredlist.org

IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.

rredlist: https://github.com/ropensci/rredlist

iucnredlist.org

IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.

redlistr: https://github.com/red-list-ecosystem/redlistr

bien.nceas.ucsb.edu/bien/

BIEN is a network of ecologists, botanists, and computer scientists working together to document global patterns of plant diversity, function and distribution.

rbien: https://github.com/bmaitner/RBIEN

sibbr.gov.br

SiBBr (Brazilian Biodiversity Information System) is an online platform that integrates data and information about biodiversity and ecosystems from different sources, making them accessible for different uses.

sibbr: https://github.com/sibbr

bto.org/our-science/projects/breeding-bird-survey

BBS (Breeding Bird Survey) involves thousands of volunteer birdwatchers carrying out standardised annual bird counts on randomly-located 1-km sites. It’s part of the NBN Atlas.

biotime.st-andrews.ac.uk

BioTime is an open access database global database of assemblage time series for quantifying and understanding biodiversity change.

BioTime Hub: https://github.com/bioTIMEHub

nhm.ac.uk/our-science/our-work/biodiversity/predicts

PREDICTS uses data on local biodiversity around the world to model how human activities affect biological communities. This biodiversity change is shown as the Biodiversity Intactness Index (BII).

3. Open Data

3. Open Data

Open means anyone can freely access, use, modify, and share for any purpose.


3. Open Data: Data standards

Darwin Core is the internationally agreed data standard to facilitate the sharing of information about biological diversity.

dwc.tdwg.org

countryCode: The standard code for the country in which the Location occurs. Recommended best practice is to use an ISO 3166-1-alpha-2 country code.
recordedBy: A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence.

3. Open Data: Licensing

Open data are licensed under open licenses. Some examples:


CC0: Public domain


CC-BY: Attribution


CC-BY-NC: Attribution - Non Commercial


CC-BY-SA: Attribution - Share Alike

3. Open Data: Data sharing

Data that are standardized and have an open licence can be shared :)

EXERCISE 1 Explore different data sources...

Imagine you want to start a project:

Chose a taxon, chose one data source and try to get distribution data.

Then answer the following 3 questions:

  • What kind of data types does the source provide?

  • Which kind of taxa are covered by the database generally?

  • How accessible is the data? Can anyone download it? Restrictions?

  • What was your experience? What issues did you encounter while getting the data?

EXERCISE 2 Mammal’s of the Czech Republic

We will use the mammals of Czech Republic as an example dataset. We will access data through GBIF using tools available in R.

Some preparation before starting to code

  • Create a new project for all your practical sessions (with a Scripts and Data folder inside).
  • Comment your code as much as possible, as if you were to explain it to others (that other could be you in 3 months!).
  • Keep your code short and easily readable in plain English.

File > New project > New directory or Existing directory

Some preparation before starting to code

  • Install the package tidyverse.
install.packages('tidyverse') # install
library(tidyverse) # load


We will be using many functions from this package, like filter(), mutate(), and later read_csv().

Some preparation before starting to code

We will use rgbif.

First, we’ll need to install the package.

install.packages('rgbif')

To use it, we load the library and check it’s working.

library(rgbif)
packageVersion('rgbif')
[1] '3.7.3'

Some preparation before starting to code

We will need the GBIF backbone taxon ID (taxonKey) for the Mammalia class. For that we will use another package called taxize.

install.packages('taxize')
library(taxize)
packageVersion('taxize')
[1] '0.9.100'

Some preparation before starting to code

Finally, we will install and load sf, which is a super useful library for mapping and spatial data analyses.

install.packages('sf')
library(sf)
packageVersion('sf')
[1] '1.0.8'

4. Data download through R

  • Create some variables that will be used later.
taxa <- "Mammalia"
country_code <- "CZ" # Two letters ISO code for Czechia
proj_crs <- 4326 # EPSG code for WGS84

4. Data download through R

So, let’s get the taxon ID for the Mammalia class

taxon_key <- get_gbifid_(taxa) %>%
  bind_rows() %>% # Transform the result of get_gbifid into a dataframe
  filter(matchtype == "EXACT" & status == "ACCEPTED") %>% # Filter the dataframe by the columns "matchtype" and "status"
  pull(usagekey) # Pull the contents of the column "usagekey"

4. Data download through R

  • Get a base map that you could use later for plotting or checking the dataset.
base_map <- rnaturalearth::ne_countries(
  scale = 110,
  type = "countries",
  country = "czechia",
  returnclass = "sf"
)

4. Data download through R

4. Data download through R

And now we can use the function occ_count() to find out the number of occurrence records for the entire Czech Republic.

occ_count(
  taxonKey = NULL,
  georeferenced = NULL,
  basisOfRecord = NULL,
  datasetKey = NULL,
  date = NULL,
  typeStatus = NULL,
  country = NULL,
  year = NULL,
  from = 2000,
  to = 2012,
  type = "count",
  publishingCountry = "US",
  protocol = NULL,
  curlopts = list()
)

4. Data download through R

How many occurrence records are in GBIF for the entire Czech Republic?

occ_count(country=country_code) # country code for Czech Republic (https://countrycode.org/)
[1] 3425603


And how many records for the mammals of Czech Republic?

occ_count(
  country = country_code,
  taxonKey = taxon_key
)
[1] 6366


We are ready to do a download. Whoop!

4. Data download through R

To do this, we will use occ_search().

occ_search(
  taxonKey = NULL,
  scientificName = NULL,
  country = NULL,
  publishingCountry = NULL,
  hasCoordinate = NULL,
  typeStatus = NULL,
  recordNumber = NULL,
  lastInterpreted = NULL,
  continent = NULL,
  geometry = NULL,
  geom_big = "asis",
  geom_size = 40,
  geom_n = 10,
  recordedBy = NULL,
  recordedByID = NULL,
  identifiedByID = NULL,
  basisOfRecord = NULL,
  datasetKey = NULL,
  eventDate = NULL,
  catalogNumber = NULL,
  year = NULL,
  month = NULL,
  decimalLatitude = NULL,
  decimalLongitude = NULL,
  elevation = NULL,
  depth = NULL,
  institutionCode = NULL,
  collectionCode = NULL,
  hasGeospatialIssue = NULL,
  issue = NULL,
  search = NULL,
  mediaType = NULL,
  subgenusKey = NULL,
  repatriated = NULL,
  phylumKey = NULL,
  kingdomKey = NULL,
  classKey = NULL,
  orderKey = NULL,
  familyKey = NULL,
  genusKey = NULL,
  establishmentMeans = NULL,
  protocol = NULL,
  license = NULL,
  organismId = NULL,
  publishingOrg = NULL,
  stateProvince = NULL,
  waterBody = NULL,
  locality = NULL,
  limit = 500,
  start = 0,
  fields = "all",
  return = NULL,
  facet = NULL,
  facetMincount = NULL,
  facetMultiselect = NULL,
  skip_validate = TRUE,
  curlopts = list(),
  ...
)

4. Data download through R

Get occurrences records of mammals from Czech Republic.

occ_search(taxonKey=taxon_key,
           country='CZ') 
Records found [6366] 
Records returned [500] 
No. unique hierarchies [38] 
No. media records [500] 
No. facets [0] 
Args [occurrenceStatus=PRESENT, limit=500, offset=0, taxonKey=359, country=CZ,
     fields=all] 
# A tibble: 500 × 98
   key    scien…¹ decim…² decim…³ issues datas…⁴ publi…⁵ insta…⁶ hosti…⁷ publi…⁸
   <chr>  <chr>     <dbl>   <dbl> <chr>  <chr>   <chr>   <chr>   <chr>   <chr>  
 1 40115… Dama d…    49.2    16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 2 40116… Castor…    50.2    14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 3 40150… Myocas…    49.7    15.1 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 4 40181… Myocas…    50.1    14.4 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 5 40149… Sus sc…    49.2    16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 6 40149… Dama d…    49.2    16.5 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 7 40149… Capreo…    49.6    16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 8 40149… Lepus …    49.6    16.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
 9 40149… Myocas…    50.1    14.6 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
10 40148… Myocas…    49.8    14.7 cdc,c… 50c950… 28eb1a… 997448… 28eb1a… US     
# … with 490 more rows, 88 more variables: protocol <chr>, lastCrawled <chr>,
#   lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
#   occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
#   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
#   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
#   kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>,
#   species <chr>, genericName <chr>, specificEpithet <chr>, taxonRank <chr>, …

Check the data output. What’s the format? How many rows does it have?

4. Data download through R

Get all occurrences records of mammals from Czech Republic.

occ_search(taxonKey=taxon_key,
           country='CZ',
            limit=6000) 


Finally, we store the result in the object mammalsCZ.

mammalsCZ <- occ_search(
  taxonKey = taxon_key, # Key 359 created previously
  country = country_code, # CZ, ISO code of Czechia
  limit = 6000, # Max number of records to download
  hasGeospatialIssue = F # Only records without spatial issues
)

mammalsCZ <- mammalsCZ$data # The output of occ_search is a list with a data object inside. Here we pull the data out of the list.

4. Data download through R

Mammals occurrence records from the Czech Republic

glimpse(mammalsCZ)
Rows: 6,000
Columns: 177
$ key                                <chr> "4011579235", "4011687250", "401505…
$ scientificName                     <chr> "Dama dama (Linnaeus, 1758)", "Cast…
$ decimalLatitude                    <dbl> 49.19989, 50.21619, 49.73967, 50.08…
$ decimalLongitude                   <dbl> 16.52097, 14.64081, 15.08824, 14.41…
$ issues                             <chr> "cdc,cdround", "cdc,cdround", "cdc,…
$ datasetKey                         <chr> "50c9509d-22c7-4a22-a47d-8c48425ef4…
$ publishingOrgKey                   <chr> "28eb1a3f-1c15-4a95-931a-4af90ecb57…
$ installationKey                    <chr> "997448a8-f762-11e1-a439-00145eb45e…
$ hostingOrganizationKey             <chr> "28eb1a3f-1c15-4a95-931a-4af90ecb57…
$ publishingCountry                  <chr> "US", "US", "US", "US", "US", "US",…
$ protocol                           <chr> "DWC_ARCHIVE", "DWC_ARCHIVE", "DWC_…
$ lastCrawled                        <chr> "2023-09-28T05:00:54.279+00:00", "2…
$ lastParsed                         <chr> "2023-09-28T12:16:10.557+00:00", "2…
$ crawlId                            <int> 399, 399, 399, 399, 399, 399, 399, …
$ basisOfRecord                      <chr> "HUMAN_OBSERVATION", "HUMAN_OBSERVA…
$ occurrenceStatus                   <chr> "PRESENT", "PRESENT", "PRESENT", "P…
$ taxonKey                           <int> 5220136, 4409131, 4264680, 4264680,…
$ kingdomKey                         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ phylumKey                          <int> 44, 44, 44, 44, 44, 44, 44, 44, 44,…
$ classKey                           <int> 359, 359, 359, 359, 359, 359, 359, …
$ orderKey                           <int> 731, 1459, 1459, 1459, 731, 731, 73…
$ familyKey                          <int> 5298, 5493, 3240572, 3240572, 5302,…
$ genusKey                           <int> 8397832, 3240758, 3240573, 3240573,…
$ speciesKey                         <int> 5220136, 4409131, 4264680, 4264680,…
$ acceptedTaxonKey                   <int> 5220136, 4409131, 4264680, 4264680,…
$ acceptedScientificName             <chr> "Dama dama (Linnaeus, 1758)", "Cast…
$ kingdom                            <chr> "Animalia", "Animalia", "Animalia",…
$ phylum                             <chr> "Chordata", "Chordata", "Chordata",…
$ order                              <chr> "Artiodactyla", "Rodentia", "Rodent…
$ family                             <chr> "Cervidae", "Castoridae", "Myocasto…
$ genus                              <chr> "Dama", "Castor", "Myocastor", "Myo…
$ species                            <chr> "Dama dama", "Castor fiber", "Myoca…
$ genericName                        <chr> "Dama", "Castor", "Myocastor", "Myo…
$ specificEpithet                    <chr> "dama", "fiber", "coypus", "coypus"…
$ taxonRank                          <chr> "SPECIES", "SPECIES", "SPECIES", "S…
$ taxonomicStatus                    <chr> "ACCEPTED", "ACCEPTED", "ACCEPTED",…
$ iucnRedListCategory                <chr> "LC", "LC", "LC", "LC", "LC", "LC",…
$ dateIdentified                     <chr> "2023-01-01T19:17:07", "2023-01-02T…
$ coordinateUncertaintyInMeters      <dbl> 31, 130, 31, 31, 15, 61, 31, 15, 77…
$ continent                          <chr> "EUROPE", "EUROPE", "EUROPE", "EURO…
$ stateProvince                      <chr> "Jihomoravský", "Středočeský", "Stř…
$ year                               <int> 2023, 2023, 2023, 2023, 2023, 2023,…
$ month                              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day                                <int> 1, 1, 1, 5, 4, 4, 3, 3, 6, 8, 2, 5,…
$ eventDate                          <chr> "2023-01-01T14:40:23", "2023-01-01T…
$ modified                           <chr> "2023-01-02T04:44:01.000+00:00", "2…
$ lastInterpreted                    <chr> "2023-09-28T12:16:10.557+00:00", "2…
$ references                         <chr> "https://www.inaturalist.org/observ…
$ license                            <chr> "http://creativecommons.org/license…
$ identifier                         <chr> "145580826", "145674501", "14584810…
$ facts                              <chr> "none", "none", "none", "none", "no…
$ relations                          <chr> "none", "none", "none", "none", "no…
$ isInCluster                        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
$ datasetName                        <chr> "iNaturalist research-grade observa…
$ recordedBy                         <chr> "Marilena Wilding", "slepice_s_fota…
$ identifiedBy                       <chr> "grigorenko", "Lefebvre Maxence", "…
$ geodeticDatum                      <chr> "WGS84", "WGS84", "WGS84", "WGS84",…
$ class                              <chr> "Mammalia", "Mammalia", "Mammalia",…
$ countryCode                        <chr> "CZ", "CZ", "CZ", "CZ", "CZ", "CZ",…
$ recordedByIDs                      <chr> "none", "none", "none", "none", "no…
$ identifiedByIDs                    <chr> "none", "none", "none", "none", "no…
$ country                            <chr> "Czechia", "Czechia", "Czechia", "C…
$ rightsHolder                       <chr> "Marilena Wilding", "slepice_s_fota…
$ identifier.1                       <chr> "145580826", "145674501", "14584810…
$ http...unknown.org.nick            <chr> "marilena_wilding", "slepice_s_fota…
$ verbatimEventDate                  <chr> "2023-01-01 14:40:23", "2023-01-01 …
$ verbatimLocality                   <chr> "Stará dálnice, 641 00 Brno-Brno-Že…
$ collectionCode                     <chr> "Observations", "Observations", "Ob…
$ gbifID                             <chr> "4011579235", "4011687250", "401505…
$ occurrenceID                       <chr> "https://www.inaturalist.org/observ…
$ taxonID                            <chr> "42161", "43793", "43997", "43997",…
$ catalogNumber                      <chr> "145580826", "145674501", "14584810…
$ institutionCode                    <chr> "iNaturalist", "iNaturalist", "iNat…
$ eventTime                          <chr> "14:40:23+01:00", "12:21:24+01:00",…
$ occurrenceRemarks                  <chr> "Observed in national park Obora Ho…
$ http...unknown.org.captive         <chr> "wild", "wild", "wild", "wild", "wi…
$ identificationID                   <chr> "324197342", "324439694", "32494044…
$ name                               <chr> "Dama dama (Linnaeus, 1758)", "Cast…
$ recordedByIDs.type                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ recordedByIDs.value                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ informationWithheld                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lifeStage                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ infraspecificEpithet               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identifiedByIDs.type               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identifiedByIDs.value              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ individualCount                    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ vernacularName                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locality                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ higherClassification               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ recordNumber                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ dynamicProperties                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxonConceptID                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.taxonRankID     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identificationVerificationStatus   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxonRemarks                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ distanceFromCentroidInMeters       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identificationRemarks              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ sex                                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ samplingProtocol                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ dataGeneralizations                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ datasetID                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ language                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ accessRights                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ eventID                            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ projectId                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ organismQuantity                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ organismQuantityType               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ otherCatalogNumbers                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ gadm                               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ associatedSequences                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ networkKeys                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ coordinatePrecision                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ institutionKey                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ acceptedNameUsage                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locationRemarks                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ collectionKey                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ preparations                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ institutionID                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ nomenclaturalCode                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferencedBy                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ type                               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ disposition                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bibliographicCitation              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ collectionID                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ elevation                          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ elevationAccuracy                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.language        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fieldNumber                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimIdentification             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locationAccordingTo                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferencedDate                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ higherGeography                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceProtocol               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ footprintWKT                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceVerificationStatus     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ endDayOfYear                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimCoordinateSystem           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ organismID                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ previousIdentifications            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identificationQualifier            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceSources                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ownerInstitutionCode               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ footprintSRS                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceRemarks                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locationID                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.recordID        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ county                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rights                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ startDayOfYear                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.recordEnteredBy <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ establishmentMeans                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ parentNameUsage                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ island                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ materialSampleID                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ associatedReferences               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimElevation                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ higherGeographyID                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ eventRemarks                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ combinationAuthors                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimScientificName             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ namePublishedIn                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ combinationYear                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.canonicalName   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimLabel                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestEraOrHighestErathem          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestEonOrHighestEonothem         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestPeriodOrHighestSystem        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestEonOrLowestEonothem        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestEraOrLowestErathem         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestEpochOrLowestSeries        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestPeriodOrLowestSystem       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestEpochOrHighestSeries         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestAgeOrLowestStage           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ namePublishedInYear                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lithostratigraphicTerms            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimTaxonRank                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestAgeOrHighestStage            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…

4. Data download through R

Mammals occurrence records from the Czech Republic

How many records do we have?

nrow(mammalsCZ)
[1] 6000


How many species do we have?

mammalsCZ %>%
  filter(taxonRank == "SPECIES") %>%
  distinct(scientificName) %>%
  nrow()
[1] 134

distinct() is used to see unique values

5. Data quality

5. Data quality

Data are not ‘good’ or ‘bad’, the quality will depend on our goal.
Some things we can check:

  • Base of the record (type of occurrence)
  • Species names (taxonomic harmonisation)
  • Spatial and temporal (accuracy / precision)

CoordinateCleaner: https://github.com/ropensci/CoordinateCleaner

Automated flagging of common spatial and temporal errors in data.

5. Data quality

As an example, we will check the following fields:

  • basisOfRecord: we want preserved specimens or observations
  • taxonRank: we want records at species level.
  • coordinateUncertaintyInMeters: we want them to be smaller than 10km.

5. Data quality

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ %>% distinct(basisOfRecord)
# A tibble: 7 × 1
  basisOfRecord     
  <chr>             
1 HUMAN_OBSERVATION 
2 OBSERVATION       
3 MATERIAL_SAMPLE   
4 PRESERVED_SPECIMEN
5 OCCURRENCE        
6 FOSSIL_SPECIMEN   
7 MATERIAL_CITATION 

distinct() is used to see unique values

5. Data quality

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ %>%
  group_by(basisOfRecord) %>%
  count()
# A tibble: 7 × 2
# Groups:   basisOfRecord [7]
  basisOfRecord          n
  <chr>              <int>
1 FOSSIL_SPECIMEN      128
2 HUMAN_OBSERVATION   4773
3 MATERIAL_CITATION     14
4 MATERIAL_SAMPLE      128
5 OBSERVATION           75
6 OCCURRENCE             2
7 PRESERVED_SPECIMEN   880

group_by() is used to group values within a variable

5. Data quality

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ <- mammalsCZ %>%
  filter(basisOfRecord == "PRESERVED_SPECIMEN" |
    basisOfRecord == "HUMAN_OBSERVATION")

Note the use of | (OR) to filter the data. Another alternative is filter(basisOfRecord %in% c("PRESERVED_SPECIMEN","HUMAN_OBSERVATION")).


How many records do we have now?

nrow(mammalsCZ)
[1] 5653

5. Data quality

  • taxonRank: we want records at species level
mammalsCZ %>% distinct(taxonRank)
# A tibble: 5 × 1
  taxonRank 
  <chr>     
1 SPECIES   
2 SUBSPECIES
3 GENUS     
4 ORDER     
5 FAMILY    

5. Data quality

  • taxonRank: we want records at species level
mammalsCZ <- mammalsCZ %>% 
  filter(taxonRank == 'SPECIES')


How many records do we have now?

nrow(mammalsCZ)
[1] 5301

5. Data quality

  • coordinateUncertaintyInMeters: we want them to be smaller than 10km
mammalsCZ %>%
  filter(coordinateUncertaintyInMeters >= 10000) %>%
  select(scientificName, 
         coordinateUncertaintyInMeters, 
         stateProvince)
# A tibble: 226 × 3
   scientificName                             coordinateUncertaintyInM…¹ state…²
   <chr>                                                           <dbl> <chr>  
 1 Myotis nattereri (Kuhl, 1817)                                   26454 Středo…
 2 Myotis myotis (Borkhausen, 1797)                                26454 Středo…
 3 Myotis myotis (Borkhausen, 1797)                                26454 Středo…
 4 Myotis myotis (Borkhausen, 1797)                                26454 Středo…
 5 Rhinolophus hipposideros (Bechstein, 1800)                      26454 Středo…
 6 Rhinolophus hipposideros (Bechstein, 1800)                      26454 Středo…
 7 Myotis myotis (Borkhausen, 1797)                                26454 Středo…
 8 Barbastella barbastellus (Schreber, 1774)                       26454 Středo…
 9 Barbastella barbastellus (Schreber, 1774)                       26454 Středo…
10 Plecotus auritus (Linnaeus, 1758)                               26454 Středo…
# … with 216 more rows, and abbreviated variable names
#   ¹​coordinateUncertaintyInMeters, ²​stateProvince

5. Data quality

  • coordinateUncertaintyInMeters: we want them to be smaller than 10km
mammalsCZ <- mammalsCZ %>% 
  filter(coordinateUncertaintyInMeters < 10000) # keeping this


How many records do we have now?

nrow(mammalsCZ)
[1] 3914

How are the records distributed?

We’ll get to this next week :)

How are the records distributed?

And finally, a simple trick to produce separate maps per order.

Any doubts?